The widespread of offensive content online, such as hate speech and cyber-bullying, is a global phenomenon. This has sparked interest in the artificial intelligence (AI) and natural language processing (NLP) communities, motivating the development of various systems trained to detect potentially harmful content automatically. These systems require annotated datasets to train the machine learning (ML) models. However, with a few notable exceptions, most datasets on this topic have dealt with English and a few other high-resource languages. As a result, the research in offensive language identification has been limited to these languages. This paper addresses this gap by tackling offensive language identification in Sinhala, a low-resource Indo-Aryan language spoken by over 17 million people in Sri Lanka. We introduce the Sinhala Offensive Language Dataset (SOLD) and present multiple experiments on this dataset. SOLD is a manually annotated dataset containing 10,000 posts from Twitter annotated as offensive and not offensive at both sentence-level and token-level, improving the explainability of the ML models. SOLD is the first large publicly available offensive language dataset compiled for Sinhala. We also introduce SemiSOLD, a larger dataset containing more than 145,000 Sinhala tweets, annotated following a semi-supervised approach.
translated by 谷歌翻译
词汇简化(LS)是自动替换复杂词的任务,使其更容易使文本更容易被各种目标人群访问(例如,识字率低,学习障碍的人,第二语言学习者)。为了训练和测试模型,LS系统通常需要在上下文中具有复杂词的CORPORA及其候选替代。为了继续提高LS系统的性能,我们引入了Alexsis-PT,这是一个新型的巴西葡萄牙LS的多候选数据集,其中包含9,605个候选替代,用于387个复杂词。 Alexsis-PT已按照Alexsis协议进行编译,用于西班牙开放跨语言模型的令人兴奋的新途径。 Alexsis-PT是第一个包含巴西报纸文章的LS多候车数据集。我们评估了该数据集上替代生成的四个模型,即Mdistilbert,Mbert,XLM-R和Bertimbau。 Bertimbau在所有评估指标中取得了最高的性能。
translated by 谷歌翻译
多词表达式(MWE)是一系列单词,共同提出的含义不是从其单个单词中得出的。处理MWE的任务在许多自然语言处理(NLP)应用中至关重要,包括机器翻译和术语提取。因此,在不同领域中检测MWE是一个重要的研究主题。在本文中,我们探索了最新的神经变压器,以检测花和植物名称中的MWES。我们在由植物和花朵百科全书创建的数据集上评估了不同的变压器模型。我们从经验上表明,Transformer模型模型优于基于长期记忆(LSTM)的先前神经模型。
translated by 谷歌翻译
多字表达式(MWES)呈现单词组,其中整体的含义不是源于其部分的含义。处理MWE的任务在许多自然语言处理(NLP)应用中至关重要,包括机器翻译和术语提取。因此,检测MWE是一个流行的研究主题。在本文中,我们在检测MWES的任务中探索了最新的神经变压器。我们在数据集中凭经验评估了Semeval-2016任务10:检测最小的语义单元及其含义(DIMSUM)。我们表明,变压器模型的表现优于先前基于长期记忆(LSTM)的神经模型。该代码和预培训模型将免费提供给社区。
translated by 谷歌翻译
仇恨言论等攻击性内容的广泛构成了越来越多的社会问题。 AI工具是支持在线平台的审核过程所必需的。为了评估这些识别工具,需要与不同语言的数据集进行连续实验。 HASOC轨道(仇恨语音和冒犯性内容识别)专用于为此目的开发基准数据。本文介绍了英语,印地语和马拉地赛的Hasoc Subtrack。数据集由Twitter组装。此子系统有两个子任务。任务A是为所有三种语言提供的二进制分类问题(仇恨而非冒犯)。任务B是三个课程(仇恨)仇恨言论,令人攻击和亵渎为英语和印地语提供的细粒度分类问题。总体而言,652名队伍提交了652次。任务A最佳分类算法的性能分别为Marathi,印地语和英语的0.91,0.78和0.83尺寸。此概述介绍了任务和数据开发以及详细结果。提交竞争的系统应用了各种技术。最好的表演算法主要是变压器架构的变种。
translated by 谷歌翻译
We present a new algorithm to learn a deep neural network model robust against adversarial attacks. Previous algorithms demonstrate an adversarially trained Bayesian Neural Network (BNN) provides improved robustness. We recognize the adversarial learning approach for approximating the multi-modal posterior distribution of a Bayesian model can lead to mode collapse; consequently, the model's achievements in robustness and performance are sub-optimal. Instead, we first propose preventing mode collapse to better approximate the multi-modal posterior distribution. Second, based on the intuition that a robust model should ignore perturbations and only consider the informative content of the input, we conceptualize and formulate an information gain objective to measure and force the information learned from both benign and adversarial training instances to be similar. Importantly. we prove and demonstrate that minimizing the information gain objective allows the adversarial risk to approach the conventional empirical risk. We believe our efforts provide a step toward a basis for a principled method of adversarially training BNNs. Our model demonstrate significantly improved robustness--up to 20%--compared with adversarial training and Adv-BNN under PGD attacks with 0.035 distortion on both CIFAR-10 and STL-10 datasets.
translated by 谷歌翻译
Artificial Intelligence (AI) and its data-centric branch of machine learning (ML) have greatly evolved over the last few decades. However, as AI is used increasingly in real world use cases, the importance of the interpretability of and accessibility to AI systems have become major research areas. The lack of interpretability of ML based systems is a major hindrance to widespread adoption of these powerful algorithms. This is due to many reasons including ethical and regulatory concerns, which have resulted in poorer adoption of ML in some areas. The recent past has seen a surge in research on interpretable ML. Generally, designing a ML system requires good domain understanding combined with expert knowledge. New techniques are emerging to improve ML accessibility through automated model design. This paper provides a review of the work done to improve interpretability and accessibility of machine learning in the context of global problems while also being relevant to developing countries. We review work under multiple levels of interpretability including scientific and mathematical interpretation, statistical interpretation and partial semantic interpretation. This review includes applications in three areas, namely food processing, agriculture and health.
translated by 谷歌翻译
交通灯检测对于自动驾驶汽车在城市地区安全导航至关重要。公开可用的交通灯数据集不足以开发用于检测提供重要导航信息的遥远交通信号灯的算法。我们介绍了一个新颖的基准交通灯数据集,该数据集使用一对涵盖城市和半城市道路的狭窄角度和广角摄像机捕获。我们提供1032张训练图像和813个同步图像对进行测试。此外,我们提供同步视频对进行定性分析。该数据集包括第1920 $ \ times $ 1080的分辨率图像,覆盖10个不同类别。此外,我们提出了一种用于结合两个相机输出的后处理算法。结果表明,与使用单个相机框架的传统方法相比,我们的技术可以在速度和准确性之间取得平衡。
translated by 谷歌翻译
图形神经网络(GNNS)在许多图形挖掘任务中取得了巨大的成功,这些任务从消息传递策略中受益,该策略融合了局部结构和节点特征,从而为更好的图表表示学习。尽管GNN成功,并且与其他类型的深神经网络相似,但发现GNN容易受到图形结构和节点特征的不明显扰动。已经提出了许多对抗性攻击,以披露在不同的扰动策略下创建对抗性例子的GNN的脆弱性。但是,GNNS对成功后门攻击的脆弱性直到最近才显示。在本文中,我们披露了陷阱攻击,这是可转移的图形后门攻击。核心攻击原则是用基于扰动的触发器毒化训练数据集,这可以导致有效且可转移的后门攻击。图形的扰动触发是通过通过替代模型的基于梯度的得分矩阵在图形结构上执行扰动动作来生成的。与先前的作品相比,陷阱攻击在几种方面有所不同:i)利用替代图卷积网络(GCN)模型来生成基于黑盒的后门攻击的扰动触发器; ii)它产生了没有固定模式的样品特异性扰动触发器; iii)在使用锻造中毒训练数据集训练时,在GNN的背景下,攻击转移到了不同​​的GNN模型中。通过对四个现实世界数据集进行广泛的评估,我们证明了陷阱攻击使用四个现实世界数据集在四个不同流行的GNN中构建可转移的后门的有效性
translated by 谷歌翻译
我们提出了具有共同总和重建(CSR)的两端源编码的问题。考虑两个终端,每个终端都可以访问两个相关源之一。两个终端都希望在某些平均变形约束下重建两个源的总和,并且两个终端处的重建必须具有很高的概率。在本文中,我们将内部和外部边界发展为双重对称二进制源的CSR问题的可实现速率失真区域。我们对Steinberg的普通重建和Wyner-Ziv的源编码进行了现有的可实现结果,并为Korner-Marton的Modulo-Two-two总计计算问题提供了可实现的结果。
translated by 谷歌翻译